2 Description of the data - EDA
We have data from three buildings located in Vienna, each with solar panels equipped. There are 2 different data collected from their sensors. One is the energy produced in kW, the other is the sun radiation. The data is collected between 2016 August 09 and 2019 July 01 with 15 minute intervals.
Furthermore, we have acquired our weather data from https://www.worldweatheronline.com/developer/api/docs/historical-weather-api.aspx using the Python script called importWeatherData.py attached in the folder. The past weather API allows us to retrieve weather data from specified time period and location. It also supports retrieval of data for multiple locations at once. In our case, we only needed data from Vienna because all the three buildings are in Vienna. Also, we have retrieved data for every 24 hours(every day) because it was the most convenient time range we could use.
2.1 Setup
# loading the libraries
suppressPackageStartupMessages({
library(ggplot2)
library(data.table)
library(forecast)
library(tidyr)
library(lubridate)
library(dplyr)
library(tseries)
library(plotly)
library(nortest)
library(astsa)
})Loading the data
# 'sun' - sensor data, otherwise the energy produced
building_2_sun <- readRDS("data/Building 2 sun.rds")
building_2 <- readRDS("data/Building 2.rds")
building_5_sun <- readRDS("data/Building 5 sun.rds")
building_5 <- readRDS("data/Building 5.rds")
building_8_sun <- readRDS("data/Building 8 sun.rds")
building_8 <- readRDS("data/Building 8.rds")
setDT(building_2_sun)
setDT(building_2)
setDT(building_5_sun)
setDT(building_5)
setDT(building_8_sun)
setDT(building_8)
setnames(building_2_sun, "1302611", "sun")
setnames(building_2, "1490017", "energy_produced")
setnames(building_5_sun, "1328370", "sun")
setnames(building_5, "1328347", "energy_produced")
setnames(building_8_sun, "1302169", "sun")
setnames(building_8, "1498763", "energy_produced")
# weather data
weather <- fread("data/vienna.csv", na.strings = c("No moonrise", "No moonset"))2.2 Sun radiation
columns <- c("building_2_sun", "building_5_sun", "building_8_sun")
plots <- lapply(columns,
function(col) {
plot_ly(data = get(col)[, .(sun=mean(sun)), by=.(timestamp=floor_date(timestamp, "weeks"))],
x = ~timestamp,
y = ~sun,
type = "scatter",
mode = "lines") %>%
layout(yaxis = list(title = paste("Sun radiation on", gsub("_sun", "", col), sep="\n")),
xaxis = list(showticklabels=T),
showlegend=F,
title="Weekly average sun radiation measured on each building")})
subplot(plots, titleY = T, nrows = 3)The sun radiation measured on each building is quite similar, indicating that the buildings are in close proximity to each other.
We can also say that the most of sun radiation takes place in the middle of summer, while the least in the winter, what makes complete sense.
2.3 Energy production
columns <- c("building_2", "building_5", "building_8")
plots <- lapply(columns,
function(col) {
plot_ly(data = get(col)[, .(energy_produced=sum(energy_produced)), by=.(timestamp=floor_date(timestamp, "weeks"))],
x = ~timestamp,
y = ~energy_produced,
type = "scatter",
mode = "lines") %>%
layout(yaxis = list(title = paste("Energy from", gsub("_", " ", as.character(col)), sep="\n")),
xaxis = list(showticklabels=T),
showlegend=F,
title="Weekly aggregated energy production of each building")})
subplot(plots, titleY = T, nrows = 3)As we can see, the data from building 2 deviates from building 5 and 8, because it has some really extreme values. So we’ll take a closer look at the outliers there.
As with sun radiation, the most amount of energy produced is in the summer seasons and the least in winter seasons, which also indicates some decent correlation between these two features.
2.4 Summary of all the datasets
Building 2
summary(building_2_sun[,sun])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0867 0.0000 0.1733 130.8078 176.2042 1089.6600
summary(building_2[,energy_produced])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -32716.85 0.00 0.00 0.49 0.55 32716.85
Building 5
summary(building_5_sun[,sun])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.056 1.869 12.919 146.211 180.131 1286.025
summary(building_5[,energy_produced])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.591 2.032 18.560
Building 8
summary(building_8_sun[,sun])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.354 3.519 3.675 132.371 177.771 1068.678
summary(building_8[,energy_produced])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3009 0.3760 2.3920
As you can see, the range of values of energy_produced feature in the building_2 is significantly different from all the other two corresponding features of other datasets, even though the median and the mean are quite similar to others.
Let us count NAs if there are any:
sum(is.na(building_2_sun))## [1] 0
sum(is.na(building_2))## [1] 0
sum(is.na(building_5_sun))## [1] 0
sum(is.na(building_5))## [1] 0
sum(is.na(building_8_sun))## [1] 0
sum(is.na(building_8))## [1] 0
No NAs found in any of the datasets!
Now let us see some boxplots, which potentially show some outliers
par(mfrow=c(2,3))
boxplot(building_2[,energy_produced], main=" Building 2 energy")
boxplot(building_5[,energy_produced], main=" Building 5 energy")
boxplot(building_8[,energy_produced], main=" Building 8 energy")
boxplot(building_2_sun[,sun], main="Building 2 sun")
boxplot(building_5_sun[,sun], main="Building 5 sun")
boxplot(building_8_sun[,sun], main="Building 8 sun")
All buildings have some number of outliers, especially above the maximum.
Now let us see the distributions of the data
a <- density(building_2[,energy_produced])
b <- density(building_2_sun[,sun])
c <- density(building_5[,energy_produced])
d <- density(building_5_sun[,sun])
e <- density(building_8[,energy_produced])
f <- density(building_8_sun[,sun])
fig1 = plot_ly(x=a$x, y=a$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 2")
fig2 = plot_ly(x=b$x, y=b$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 2")
fig3 = plot_ly(x=c$x, y=c$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 5")
fig4 = plot_ly(x=d$x, y=d$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 5")
fig5 = plot_ly(x=e$x, y=e$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Energy produced of building 8")
fig6 = plot_ly(x=f$x, y=f$y, type= "scatter", mode = "lines", fill = "tozeroy", name="Sun ratiation of building 8")
fig <- subplot(fig1, fig2, fig3, fig4, fig5, fig6, nrows = 6)
figDistributions seem to be normal, however we can see that there is a number of extreme values, basically what we saw in the boxplots as well. Also, energy produced of building 2 is of different range compared to others, which is weird.
Plotting sun radiation vs energy produced on one plot
building_5_sun %>%
plot_ly(
x=~timestamp,
y=~sun,
type="scatter",
mode="lines",
name="sun radiation",
line = list(color='#ff7f0e')
) %>%
add_trace(
inherit = F,
data=building_5,
x=~timestamp,
y=~energy_produced,
type="scatter",
mode="lines",
name="energy produced",
yaxis = "y2",
line = list(color = '#1f77b4')
) %>%
layout(
title = "Building 5",
yaxis2 = list(
tickfont = list(color = '#ff7f0e'),
overlaying = "y",
side = "right",
title = "second y axis - energy"
)
)